Topic models are easy to train, but do they generate useful topics? In this post, we discuss how to calculate several diagnostic metrics that Mallet uses to assess topic quality and conduct a principal component analysis (PCA) to determine which underlying features are most important. Since many of the evaluation metrics are highly correlated, PCA is an appropriate analytical approach. PCA is a statistical technique used to re-express highly correlated multivariate data in uncorrelated components that capture independent pieces of information represented in the larger data.
To accomplish this, we use Mallet to generate fifty topics for a corpus of over 264K posts found on publicly available Facebook pages related to COVID-19 and fifty topics for a corpus of ~11 million Twitter posts related to COVID-19. We used hashtag pooling to generate topics for the Twitter corpus. We use Python to calculate diagnostic measures from Mallet topic-term frequency output files.
Based on our interpretation of the PCA results, we believe LDA topics are distinguished by two primary factors: 1) topic frequency, and 2) top-term specificity. Furthermore, on average, we found common topics with specific top terms score significantly better on coherence scores than uncommon topics with unspecific top terms. However, we also found many poor topics score relatively high on coherence scores. In other words, our results suggest topics that capture the specific, main topics of a corpus should be easier to interpret, but the interpretability of a topic doesn’t imply a topic is specific or central to a corpus.
First, we compare the evaluation metrics associated with each of the data sets. The density plots below show there are clear differences between the diagnostic measures for Facebook and Twitter topics. For example, Facebook has better topic coherence scores which implies the top terms of each topic co-occur more often in Facebook posts. Likewise, Facebook topics have a larger tokens per topic count which indicates the top terms of each topic occur more frequently in the Facebook corpus.
Prior to performing PCA, we must standardize the data so the metrics have a common scale. We standardize the data so that each measure has a mean of zero and a standard deviation of 1. The density plots below show the distributions of the standardized data.
Correlation analysis of the evaluation metrics shows that we are dealing with highly correlated multivariate data. There is a strong negative correlation between token count and exclusivity-in other words, exclusive topics don’t occur frequently in the corpus. We also see a negative correlation between average word length and the effective number of words which implies effective terms tend to be shorter in length. Corpus entropy has a strong positive correlation with exclusivity; whereas, coherence has a strong negative correlation with corpus entropy and exclusivity. In other words, the top terms for coherent topics tend to use words that are common to the corpus, and coherent topics tend to occur more frequently.
A scree plot of the eigenvalues of the correlation matrix suggests we should retain two principal components (PCs). The general rule of thumb is to keep PCs that are “one less than the elbow” of the scree plot or PCs with an eigenvalue of 1 or greater.
The loading matrix below shows token count, corpus entropy, exclusivity, and coherence contribute the most to the 1st PC, which explains 44.6% of the variance based on the eigenanalysis (2.679/6 = 44.6%). Uniform entropy and the effective number of words contribute most to the 2nd PC, which explains 31.5% of the variance. Coherence contributes most to the 3rd PC, which explains 11.4% of the variance.
Based on the loading plot loading plots below, the 1st PC appears to capture topic frequency. Exclusivity and corpus entropy are positioned on the far right and imply a topic does not appear often in the corpus. Token count is positioned on the far left and implies a topic appears often in the corpus. The 2nd PC appears to capture top-term frequency. The number of effective words is positioned on the bottom and implies the top terms for individual topics are top terms in many topics. Uniform entropy is located at the top and indicates a topic distribution (as well as the top terms of that distribution) does not add much information when compared to using a uniform distribution.
We examine score plots of the PC values associated with each topic to validate our interpretation of the PCs. The interactive plots below facilitate doing a more subjective evaluation of specificity and allow us to apply human intuition about which topics would likely be more prevalent in the corpora.
Sizing the score plot by token count supports our interpretation that the 1st PC captures topic frequency. Likewise, sizing the points by the effective number of words supports our interpretation that the 2nd PC captures top-term specificity.
The score plot with points sized by coherence scores is less clear. It looks like the more coherent topics (i.e., the points with a smaller diameter) are concentrated more heavily in the top right quadrant and the less coherent scores are concentrated in the bottom left. However, there are many relatively small-sized points scattered throughout all four quadrants. Given that coherence scores are based on top terms co-occuring in documents, the pattern that emerges in the score plot makes more sense.
##_
For example, the Twitter topic in the top left includes the following top terms: economy, people, urge, coronavirus, million, debt, package, needed, student, and stimulate. This topic is very coherent. Terms like “stimulate” and “economy” or “student” and “debt” are common word pairings. Hence, it’s very plausible that posts about economic stimulus related to student debt could generate this topic. Likewise, it’s reasonable to think this topic would be relatively less central to a COVID-19 discussion focused on health risks, disease prevention, and government restrictions. In contrast, the Facebook topic in the bottom right includes the following top terms: people, government, crisis, world, pandemic, time, coronavirus, political, country, and public. This topic is not very specific, but the top terms seem like words that would co-occur frequently. This topic also seems like it would be central to the COVID-19 discussion.
Plotting the normalized coherence scores by each quadrant of the score plot allows us to see that the coherence scores are significantly better for common topics with specific top terms. However, uncommon topics with unspecific terms have a large overlap with common topics with specific terms in the Twitter topics. Stated simply, coherence looks like a good measure most of the time, but not always.
So, what is the best metric to evaluate topic quality? It depends.
If the goal is to find topics that are most representative of a corpus of documents, we believe the combination of high token count and high uniform entropy will identify relatively coherent topics. Moreover, we don’t think using coherence alone is prudent. Coherence doesn’t imply a topic is central to a discussion, and it doesn’t imply a topic has a specific focus.
In contrast, if the goal is to quickly surface unique insights that may not be readily apparent, even after reading many documents, then low token count and high uniform entropy may be better suited. The downside is that these topics may require more contextual cues and effort to understand how the top terms are related.